Practice Set - Week 2

Author

SOPHAS

Published

24-Apr-2025

using StatsBase

Q1. The pass result of 50 students who took a class test is given below:

Marks No of students
40 8
50 10
60 9
70 6
80 4
90 3

If the mean marks for all the students were 51.6, find out the mean marks of the students who failed.

Answer

# Given data
marks = [40, 50, 60, 70, 80, 90]
students = [8, 10, 9, 6, 4, 3]
total_students = 50 # Total number of students
total_mean = 51.6 # Mean marks for all students

# Total marks of all students
marks_total = total_mean * total_students

# Total marks of passed students
marks_passed = sum(marks .* students)

# Total marks of failed students
marks_failed = marks_total - marks_passed

# Mean marks of students who failed
passed_students = sum(students) # Number of passed students
failed_students = total_students - passed_students # Number of failed students

mean_failed = marks_failed / failed_students

println("The mean marks of the students who failed is $mean_failed.")
The mean marks of the students who failed is 21.0.

Q2. The average dividend declared by a group of 10 pharma companies was 18 per cent. Later on, it was discovered that one correct figure, 12, was misread as 22. Find the correct average dividend.

Answer

# Given data
incorrect_avg = 18
n_companies = 10

# Incorrect and correct figure
incorrect_fig = 22
correct_fig = 12

# Calculate incorrect total dividend
incorrect_total = incorrect_avg * n_companies

# Adjust total for the corrected figures
correct_total = incorrect_total - incorrect_fig + correct_fig

# Calculate corrected average dividend
correct_avg = correct_total / n_companies

println("The correct average dividend is $correct_avg %.")
The correct average dividend is 17.0 %.

Q3. The mean of 200 observations was 50. Later on, it was found that two observations were misread as 92 and 8 instead of 192 and 88. Find the correct mean.

Answer

# Given data
incorrect_mean = 50
n = 200

# Incorrect and correct observations
incorrect_obs = [92, 8]
correct_obs = [192, 88]

# Calculate incorrect total sum
incorrect_total = incorrect_mean * n

# Adjust total for the corrected observations
correct_total = incorrect_total - sum(incorrect_obs) + sum(correct_obs)

# Calculate corrected mean
correct_mean = correct_total / n

println("The corrected total mean is $correct_mean.")
The corrected total mean is 50.9.

Q4. Let \(x_1\) and \(x_2\) be arithmetic means of two sets of data of the same nature of size n1 and n2 respectively. Show that their combined A.M. can be calculated as

Answer

\[ \frac{(n_1 \times x_1) + (n_2 \times x_2)}{(n_1 + n_2)} \]

  1. Defintion of arithmetic mean:

\[ x = \frac{\text{sum of data points}}{\text{number of data points}} \]

  1. Define the sums for each data set:

\[ \text{Sum of data points in set 1} = n_1 \times x_1 \]

\[ \text{Sum of data points in set 2} = n_2 \times x_2 \]

  1. Calculate the combined sum and total number of data points:

\[ \text{Total sum of data points} = (n_1 \times x_1) + (n_2 \times x_2) \]

\[ \text{Total number of data points} = n_1 + n_2 \]

  1. Apply the definition of arithmetic mean to the combined data:

\[ x = \frac{\text{Total sum of data points}}{\text{Total number of data points}} \]

\[ x = \frac{(n_1 \times x_1) + (n_2 \times x_2)}{(n_1 + n_2)} \]


Q5. Average marks section A of class 9 is 70 and average marks of section B of class 9 is 80. Find the average marks of class 9, assuming:

Answer

# Given data
a_avg = 70
b_avg = 80

# Calculate total average
total_avg1 = (a_avg + b_avg) / 2

println("The average marks of class 9, assuming both sections have equal students is $total_avg1.")
The average marks of class 9, assuming both sections have equal students is 75.0.

# Given data
a_students = 55 #n1
a_avg = 70 #x1
b_students = 45 #n2
b_avg = 80 #x2

# Calculate total average: ((n1 * x1) + (n2 * x2)) / (n1 + n2)
total_avg2 = ((a_students * a_avg ) + (b_students * b_avg)) / (a_students + b_students)

println("The average marks of class 9, with 55 and 45 students in Sections A and B, is $total_avg2.")
The average marks of class 9, with 55 and 45 students in Sections A and B, is 74.5.

Q6. \(x_1\), \(x_2\), \(…\), \(x_n\) are \(n\) values. We are to calculate sum of squared deviation around a number \(a\) as follows:

Answer

\[ SSD = (x_1 - a)^2 + (x_2 - a)^2 + … + (x_n - a)^2 \]

Show that \(SSD\) is minimum when a is the mean of the given values.

  1. Expression for SSD: \[ \text{SSD} = \sum_{i=1}^{n} (x_i - a)^2 \]

  2. Expand squared terms: \[ \text{SSD} = \sum_{i=1}^{n} (x_i^2 -2ax_i + a^2) \]

  3. Separate the summation: \[ \frac{d(\text{SSD})}{da} = \sum_{i=1}^{n} x_i^2 -2a \sum_{i=1}^{n} x_i + na^2 \]

  4. Minimize with respect to \(a\):

\[ \frac{d(\text{SSD})}{da} = -2 \sum_{i=1}^{n} x_i + 2na \]

\[ -2 \sum_{i=1}^{n} x_i + 2na = 0 \]

  1. Solve for \(a\):

\[ 2na = 2 \sum_{i=1}^{n} x_i \]

\[ a = \frac{\sum_{i=1}^{n} x_i}{n} \]

The SSD is minimized when \(a\) is the mean of the values:

\[ a = \bar x = \frac{\sum_{i=1}^{n} x_i}{n} \]

This makes sense intuitively because the mean is the value that balances the deviations on both sides, minimizing the squared differences.

For example, consider a data (1, 2, 3, 4, 5), calculate SSD for \(a\) = mean and another \(a\), say \(a\) = 1

# Create a function for SSD
ssd(vec::Vector{<:Number}, val::Number) = sum((vec .- val).^2)

# Given data
data = [1, 2, 3, 4, 5]

# Calculate mean
mean_data = mean(data)

# Calculate SSD for a = mean
ssd1 = ssd(data, mean_data)

println("ssd(data, mean_data) = $ssd1")

# Calculate SSD for a = 1
ssd2 = ssd(data, 1)

println("ssd(data, 1) = $ssd2")
ssd(data, mean_data) = 10.0
ssd(data, 1) = 30

To compute SSD, we create a function in assignment form. We assign the variable types in the function definition and equate it to the operation that calculates SSD.

Functions can be created without specifying types, but defining them is preferred.

Number type takes Integers and Floats while Vector{<:Number} type takes a Vector of Number.


Q7. Median is not affective by extreme values. Give examples in support of your answer.

Answer

The median represents the middle value of a dataset when it is arranged in ascending or descending order. Since it is based on the position of the middle value(s) rather than the actual numerical values, it is not significantly affected by extreme values (outliers).

# Example 
a = [10, 12, 14, 16, 18] # Dataset a
median_a = median(a)

b = [10, 12, 14, 16, 100] # Dataset b with an extreme value
median_b = median(b)

print("Median of dataset a and b are $median_a and $median_b respectively.")
Median of dataset a and b are 14.0 and 14.0 respectively.

The extreme value (100) does not change the median, showing its resistance to outliers.


Q8. Somnath does data analysis for a company. The data on tea consumption (in grams) for the company are as follows:

Answer

14.77 16.11 16.11 15.05 15.99 14.91 15.27 16.01 15.75 14.89 16.05 15.22 16.02 15.24 16.11 15.02

# Given data
data = [14.77 16.11 16.11 15.05 15.99 14.91 15.27 16.01 15.75 14.89 16.05 15.22 16.02 15.24 16.11 15.02]

# Calculate mean and median
mean_data = mean(data)
median_data = median(data)

print("The mean and median for the data on tea consumption are $(mean_data)g and $(median_data)g respectively.")
The mean and median for the data on tea consumption are 15.5325g and 15.51g respectively.

The mean (15.53 grams) and median (15.51 grams) are very close, indicating a symmetrical distribution with minimal skewness, so both measures are suitable; however, the mean may be preferred for further statistical analysis as it incorporates all data values since there are no extreme values in this dataset.


Q9. Following are the responses from 55 employees to the question about how much time they travel to reach the office (in mins).

055 060 080 080 080 085 085 085 090 090 090

090 092 094 095 095 095 095 100 100 100 100

100 100 105 105 105 105 109 110 110 110 110

112 115 115 115 115 115 120 120 120 120 120

124 125 125 125 130 130 140 140 140 145 150

Calculate the range and interquartile range and interpret your result.

Answer

# Given data
travel_time = vec([055 060 080 080 080 085 085 085 090 090 090

090 092 094 095 095 095 095 100 100 100 100

100 100 105 105 105 105 109 110 110 110 110

112 115 115 115 115 115 120 120 120 120 120

124 125 125 125 130 130 140 140 140 145 150])

# Calculate range 
range_travel_time = maximum(travel_time) - minimum(travel_time)

#Calculate Q1, Q3, and IQR
q1_travel_time = quantile(travel_time, 0.25)
q2_travel_time = quantile(travel_time, 0.75)
iqr_travel_time = iqr(travel_time)

println("The range is: $range_travel_time mins.")
println("The first quartile is: $q1_travel_time mins.")
println("The third quartile is: $q2_travel_time mins.")
println("The IQR is: $iqr_travel_time mins.")
The range is: 95 mins.
The first quartile is: 94.5 mins.
The third quartile is: 120.0 mins.
The IQR is: 25.5 mins.

The range of 95 minutes indicates the spread between the shortest and longest travel times among employees.

The interquartile range (IQR) of 25.5 minutes indicates that the middle 50% of employees take between 94.5 and 120 minutes to reach the office. This suggests that most employees’ travel times are fairly concentrated within this range, reducing the effect of extreme values (outliers).

The quantile() function is designed for vectors, but if working with matrices, you can convert them to a vector using vec().


Q10. An agriculture farm sells grab bags of flower bulbs. The bags are sold by weight; thus the number of bulbs in each bag can vary depending on the varieties included.

Below are the number of bulbs in each of the 20 bags sampled:

21 33 37 56 47 25 33 32 47 34

36 23 26 33 37 26 37 37 43 45

Answer

# Given data
bulbs = [21 33 37 56 47 25 33 32 47 34 36 23 26 33 37 26 37 37 43 45]

# Calculate mean and median
mean_bulbs = mean(bulbs)
median_bulbs = median(bulbs)

println("The mean and median of the number of bulbs per bag are $mean_bulbs and $median_bulbs respectively.")
The mean and median of the number of bulbs per bag are 35.4 and 35.0 respectively.

The mean and median are very close, suggesting an approximately symmetrical distribution with no strong skewness, and although a few larger values exist (e.g., 56), their effect on the mean is minimal, indicating a near-normal distribution


Q11. The wholesale prices of a commodity for a week are as follows:

Days: 1 2 3 4 5 6 7

Commodity price/kg: 240 260 270 245 255 286 264

Calculate the variance and standard deviation.

Answer

# Given data
com_price = [240 260 270 245 255 286 264] # commodity in Rs/kg 

# Calculate variance and standard deviation
var_com_price = var(com_price)
std_com_price = std(com_price)

println("The variance is: $(round(var_com_price, digits = 2)) Rs²/kg².")
println("The standard deviation is $(round(std_com_price, digits = 2)) Rs/kg.")
The variance is: 240.33 Rs²/kg².
The standard deviation is 15.5 Rs/kg.

Q12. (Continue from Q11) Suppose for next week, price/kg is just 10 more than previous week.

Calculate the variance and standard deviation. What is your conclusion?

Answer

# Given data 
com_price = [240 260 270 245 255 286 264] .+ 10 # Previous week's price + Rs 10.

# Calculate variance and standard deviation
var_com_price = var(com_price)
std_com_price = std(com_price)

println("The variance is: $(round(var_com_price, digits = 2)) Rs²/kg².")
println("The standard deviation is $(round(std_com_price, digits = 2)) Rs/kg.")
The variance is: 240.33 Rs²/kg².
The standard deviation is 15.5 Rs/kg.

The variance and standard deviation remain unchanged when all prices increase by Rs/kg 10 because they measure the spread of data, not its absolute position, meaning a constant change affects the mean but not the spread.